5 research outputs found
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
The key common bottleneck in most stencil codes is data movement, and prior
research has shown that improving data locality through optimisations that
schedule across loops do particularly well. However, in many large PDE
applications it is not possible to apply such optimisations through compilers
because there are many options, execution paths and data per grid point, many
dependent on run-time parameters, and the code is distributed across different
compilation units. In this paper, we adapt the data locality improving
optimisation called iteration space slicing for use in large OPS applications
both in shared-memory and distributed-memory systems, relying on run-time
analysis and delayed execution. We evaluate our approach on a number of
applications, observing speedups of 2 on the Cloverleaf 2D/3D proxy
application, which contain 83/141 loops respectively, on the linear
solver TeaLeaf, and on the compressible Navier-Stokes solver
OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of
CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's
Knights Landing, demonstrating maintained throughput as the problem size grows
beyond 16GB, and we do scaling studies up to 8704 cores. The approach is
generally applicable to any stencil DSL that provides per loop data access
information
Performance prediction and procurement in practice : assessing the suitability of commodity cluster components for wavefront codes
The cost of state-of-the-art supercomputing resources makes each individual purchase a length and expensive process. Often each candidate architecture will need to be benchmarked using a variety of tools to assess likely performance. However, benchmarking alone only provides a limited insight into the suitability of each architecture for key codes and will give potentially misleading results when assessing their scalability. In this study the authors present a case study of the application of recently developed performance models of the Chimaera benchmarking code written by the United Kingdom Atomic Weapons Establishment (AWE), with a view to analysing how the code will perform and scale on a medium sized, commodity-based InfiniBand cluster. The models are validated and demonstrate a greater than 90% accuracy for an existing InfiniBand machine; the models are then used as the basis for predicting code performance on a variety of alternative hardware configurations which include changes in the underlying network, the use of faster processors and the use of a higher core density per processor. The results demonstrate the compute-bound nature of Chimaera and its sensitivity to network latency at increased processor counts. By using these insights the authors are able to discuss potential strategies which may be employed during the procurement of future mid-range clusters for wavefront-rich workloads
Performance modelling of magnetohydrodynamics codes
Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the run-time of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory